Audience: Diverse Background
Time: 1 day workshop (6 hours)
Pre-Requisites: Completion of NLP Introductory course developed by the Data Science Campus team and fulfillment of all prerequisites stated there.
Brief Description: This course will focus on three key topics in Natural Language Processing, information retrieval, classification and sentiment analysis. The information retrieval covers the building blocks of a search engine - the inverted index and maps out in detail with both illustrations and code how an information retrieval application can be built. Three disparate, classical approaches will be examined to fulfil this objective. Classification will then be outlined, focusing on its supervised machine learning foundations. A real-world classification problem of news classification will be illustrated using a BBC news dataset.The course will conclude with another look at classification from the challenging field of sentiment analysis.
Aims, Objectives and Intended Learning Outcomes: This module will map out salient features and challenges in information retrieval and text classification. Learners should attain competency in building information retrieval applications and applying text classification techniques to key problem domains such as sentiment analysis. By the end of the module, learners should be able apply tools and methods taught from the three main approaches to tackle an information retrieval task. Similarly, learners should become conversant with the basics of a supervised machine learning task and how it maps to a text classification problem. They should again be able to apply the tools and techniques taught to solve real world problems such as news classification.
Dataset: BBC News headline Dataset, Airline tweets sentiments Dataset, IATI (descriptions on aid activity) dataset
Libraries: Before attending the course please make sure that you read the course instructions that you received.
Acknowledgements: Many thanks to Isabela Breton and Dan Lewis for reviewing the material. Ceri Regan for helping to roll out the course to graduates. Many thanks to the Data Science Campus team based at Abercrombie House, East Kilbride for also reviewing course and commentary. Also thanks to everyone who attended the pilot course and provided feedback.
Intended Learning Outcomes: By the end of Chapter 2 you should be able to:-
Define key terms in Information Retrieval (IR).
List at a high level of abstraction key steps in developing an IR application.
Describe how IR can be challenging
Describe an Inverted Index
Set up an inverted index for a document collection in Python using SCI Learn
Define 3 models used to build an IR application
Describe the Boolean Retrieval Model
Set up a Boolean Retrieval search over a document collection
Describe VSM approach to IR
Set up a VSM based IR program over a document collection
Describe Language Modelling approach to IR
Calculate maximum liklihood estimates for terms in a document collection.
Apply Linear Interpolation to query/document to determine a probability score for query/document.
The need for IR applications
Amount of digital data in 2007: 281 exabytes = 281 trillion digitized novels
“Every 2 days now, we create as much information as we did from the dawn of civilisation up until 2003”
Eric Schmidt
The meaning of the term Information Retrieval can be quite broad.
Every time you look up information to get a task done could be considered IR.
A useful definition given by Manning (2009):
IR is finding material (usually documents) of an unstructured nature (usually text) that satifies an information need from within larger collections (usually stored on computers)
Key Terms used in Information Retrieval
An information need is the topic about which the user desires to know more about.
A query is what the user conveys to the computer in an attempt to communicate the information need.
A document is an information entity the user wants to retrieve.
A document is relevant if the user perceives that it contains information of value with respect to their personal information need.
A collection is a set of documents.
A term is a word or concept that appears in a document or query
An index is a representation of information that makes querying easier
Information Retrieval vs Web Search
IR is more than web search
IR is concerned with the finding of (any kind of) relevant information
Up until a few decades ago, people preferred to get information from other people eg booking travel via a human travel agent, librarians to search for books, paralegals etc. It used to be an activity only a few people engaged in.
The world has changed, hundreds of millions of peope engage in information retrieval (IR) every data through web search. However many other cases of IR eg email search, searching your laptop, interrogating corporate knowledge bases are also commonplace examples of search.
Information retrieval has overtaken database retrieval as most information does not reside in database systems.
Related to the above are the following issues:
Questions to tackle in retrieval
The task in information retrieval is this: we have vast amounts of information to which accurate and speedy access is becoming ever more difficult. One effect of this is that relevant information gets ignored since it is never uncovered, which in turn leads to much duplication of work and effort.. With the advent of computers, a great deal of thought has been given to using them to provide rapid and intelligent retrieval systems. The idea of relevance has slipped into the discussion. It is this notion which is at the centre of information retrieval. The purpose of an automatic retrieval strategy is to retrieve all the relevant documents at the same time retrieving as few of the non-relevant as possible. An IR system should generate a ranking which reflects relevance.
Most search engines use bag of words to build retrieval models. The document is treated as a bag of words
Basic Concept: Each document is described by a set of representative keywords called index terms.
Assign a numerical weight to index terms
The above index is often represented as a dictionary file of terms with an associated postings file.
This inverted index structure is essentially without rivals as the most efficient structure for supporting ad hoc text search.
inverted_index_example = ["He likes to wink, He likes to drink!", "He likes to drink, and drink, and drink.", "The thing he likes to drink is ink","The ink he likes to drink is pink","He likes to wink, and drink pink ink" ]
def set_tokens_to_lowercase(data):
for index, entry in enumerate(data):
data[index] = entry.lower()
return data
def remove_punctuation(data):
symbols = ",.!"
for index, entry in enumerate(symbols):
for index2, entry2 in enumerate (data):
data[index2] = re.sub(r'[^\w]', ' ', entry2)
return data
def remove_stopwords_from_tokens(data):
stop_words = set(stopwords.words("english"))
new_list = []
for index, entry in enumerate(data):
no_stopwords = ""
entry = entry.split()
for word in entry:
if word not in stop_words:
no_stopwords = no_stopwords + " " + word
new_list.append(no_stopwords)
return new_list
inverted_index_example = remove_stopwords_from_tokens(remove_punctuation(set_tokens_to_lowercase(inverted_index_example)))
vectorizer = CountVectorizer()
inverted_index_vectorised = vectorizer.fit_transform(inverted_index_example)
#if u want to look at it
tdm = pd.DataFrame(inverted_index_vectorised.toarray(), columns = vectorizer.get_feature_names())
print (tdm.transpose())
## 0 1 2 3 4
## drink 1 3 1 1 1
## ink 0 0 1 1 1
## likes 2 1 1 1 1
## pink 0 0 0 1 1
## thing 0 0 1 0 0
## wink 1 0 0 0 1
The following can be said about the inverted index:-
• It maps terms to the documents that contain them. It “inverts” the collection (which maps documents to the words they contain)
• It permit us to answer boolean queries without visiting entire corpus
• It is slow to construct (requires visiting entire corpus) but this only needs to be done once
• It can be used for any number of queries
• It can be done before any queries have been seen
Exercise:
Doc 1: New home sales top forecasts.
Doc 2: Home sales rise, in July!
Doc 3 Increase in home sales, in July.
Doc 4 July new home sales rise.
Optional Extra:
Find documents matching query “pink ink”
2. Both words has to be a phrase
We could have a bi-gram index
Bi-gram index issues:
Fast but index size will explode
What aboout trigram phrases
What about proximity? “ink is pink”
A possible solution: Proximity Index
Term position is embedded to the inverted index
Called proximity/positional index
Enables phrase and proximity search
Implement positional inverted index on data shown below.
You need to save the following information in terms inverted lists:
- term (pre-processed) and its document frequency
- list of documents where this term occured
- for each document, list of positions where the term occured within the document
Doc 1: breakthrough drug for schizophrenia
Doc 2: new schizophrenia drug
Doc 3: new approach for treatment of schizophrenia
Doc 4: new hopes for schizophrenia patients
For effectively retrieving relevant documents by IR strategies, the documents are typically transformed into a suitable representation. Each retrieval strategy incorporates a specific model for its document representation purposes.
A retrieval model specifies the details of:
• Document representation
• Query representation
• Retrieval function: how to find relevant results
• Determines a notion of relevance
In classical IR models a document is described as a set of representative keywords - index terms . Each term is assigned a numerical weight to determine relavance.
The simplest form of document retrieval is for a computer to do this sort of linear scan through documents. This process is commonly GREP referred to as grepping through text, after the Unix command grep, which performs this process. However,searching through large collections (billions to trillions of words) is unacceptably slow. More flexible matching operations require ranked retrieval.
One alternative to linearly scanning is to index the documents in advance.
Suppose we record for each document – here a play of Shakespeare’s – whether it contains each word out of all the words Shakespeare used ( INCIDENCE MATRIX about 32,000 different words). The result is a binary term-document incidence matrix, as in Figure. Terms that are indexed are usually words.
We can have a vector for each term, which shows the documents it appears in, or a vector for each document, showing the terms that occur in it. To answer the query Brutus AND Caesar AND NOT Calpurnia, we take the vectors for Brutus, Caesar and Calpurnia, complement the last, and then do a bitwise AND: 110100 AND 110111 AND 101111 = 100100.
The answers for this query are thus Antony and Cleopatra and Hamlet.
Question
What are the returned results for query
data = ["He likes to wink, He likes to drink!", "He likes to drink, and drink, and drink.", "The thing he likes to drink is ink","The ink he likes to drink is pink","He likes to wink, and drink pink ink" ]
data = remove_stopwords_from_tokens(remove_punctuation(set_tokens_to_lowercase(data)))
binary_vectorizer = CountVectorizer(binary=True)
counts = binary_vectorizer.fit_transform(data)
#if u want to look at it
tdm = pd.DataFrame(counts.toarray(), columns = binary_vectorizer.get_feature_names())
tdm=tdm.transpose()
print (tdm)
## 0 1 2 3 4
## drink 1 1 1 1 1
## ink 0 0 1 1 1
## likes 1 1 1 1 1
## pink 0 0 0 1 1
## thing 0 0 1 0 0
## wink 1 0 0 0 1
def NOT(pterm):
for a in range(len(pterm)):
if(pterm[a] == 0):
pterm[a] = 1
elif(pterm[a] == 1):
pterm[a] = 0
return pterm
term1 = list(tdm.loc['drink'])
term2 = list(tdm.loc['ink'])
term3 = NOT(list(tdm.loc['pink']))
terms = list(zip(term1, term2, term3))
vector= [terms[item][0] & terms[item][1] & terms[item][2]for item in range(len(terms))]
for i in vector:
if i == 1:
print ("Document", vector.index(i), "meets search term criteria")
## Document 2 meets search term criteria
The Boolean retrieval model is a model for information retrieval in which we can pose any query which is in the form of a Boolean expression of terms, that is, in which terms are combined with the operators AND, OR, and NOT. The model views each document as just a set of words
The following can be said of the Boolean Retrieval Model:-
• It can answer any query that is made up of boolean expressions
• Boolean queries are queries that use and, or and not to join query terms
• Views each document as a set of terms
• It is precise - document matches conditions or not
• Primary commercial retrieval tool for 3 decades
• Many professional searchers (e.g., lawyers) still like boolean queries
• You know exactly what you are getting
• It does not have a built-in way of ranking matched documents by some notion of relevance
• It is easy to understand. Clean formalism
• It is too complex for web users
• Incidence matrix is impractical for big collections
Exercise:
Consider these documents:
Doc 1: breakthrough drug for schizophrenia
Doc 2: new schizophrenia drug
Doc 3: new approach for treatment of schizophrenia
Doc 4: new hopes for schizophrenia patients
The representation of a set of documents as vectors in a common vector space is known as the Vector Space Model . Every distinct word has one dimension.
Key idea: Documents and queries are vectors in a high-dimensional space.
Key issues:
• What to select as the dimensions of this space?
• How to convert documents and queries into vectors?
• How to compare queries with documents in this space?
The Vector Space Model assumes that
• the degree of matching can be used to rank-order documents;
• this rank-ordering corresponds to how well a document satisfying a user’s information need
Steps in Vector Space Modelling
• Convert the query to a vector of terms
• Weight each component.
• Consult the index to find all documents containing each term
• Convert each document to a weighted vector
• Query and documents mapped to vectors and their angles compared
• Match the query vector against each document vector and sort the documents by their similarity
• Similarity based on occurrence frequencies of keywords in query and document
• Output documents are ranked according to similarity to query
Challenges
• Finding a good set of basis vectors.
• Finding a good weighting scheme for terms, since model provides no guidance.
Usually variations on (length normalised) tf*idf
• Finding a comparison function, since again the model provides no guidance. Usually cosine comparison.
Comments on Vector Space Models
• Simple, practical, and mathematically based approach
• Lacks the control of a Boolean model (e.g., requiring a term to appear in a document)
Overall, Vector Space Models are hard to beat
Consider below documents and a query term
Document 1: Cat runs behind rat
Document 2: Dog runs behind cat
Query: rat
A term document matrix would be set up. This is a way is a way of representing documents vectors
in a matrix format in which each row represents term vectors across all the documents and columns represent document vectors across all the terms.
Term weights are calculated for all the terms in the matrix across all the documents.
A word which occurs in most of the documents might not contribute to represent the document relevance whereas less frequently occurred terms might define document relevance. This can be achieved using a method known as term frequency - inverse document frequency (tf-idf) which gives higher weights to the terms which occurs more in a document but rarely occurs in all other documents, lower weights to the terms which commonly occurs within and across all the documents. Tf-idf = tf X idf
Similarity Measures: cosine similarity
Mathematically, closeness between two vectors is calculated by calculating the cosine angle between two vectors. The cosine angle between each document vector and the query vector is calculated to find its closeness. To find relevant document to the query term , the similarity score between each document vector and the query term vector by is calculated by applying cosine similarity . Whichever documents have a high similarity score will be considered as relevant documents to the query term.
Summary on VSM
The IATI dataset will be used, further details on this dataset can be found here https://iatistandard.org/en/iati-standard/ The dataset used below is a subset which provides description on aid activity undertaken by various organisation in the aid sector around the world.
import operator
import pandas as pd
import re
import sklearn
from sklearn.decomposition import PCA
from sklearn import feature_extraction
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk import ngrams
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.corpus import brown
from nltk.collocations import *
from nltk.corpus import webtext
import numpy as np
import random
import pickle
from sklearn.metrics.pairwise import cosine_similarity
def set_tokens_to_lowercase(data):
for index, entry in enumerate(data):
data[index] = entry.lower()
return data
def remove_punctuation(data):
symbols = ",.!"
for index, entry in enumerate(symbols):
for index2, entry2 in enumerate (data):
data[index2] = re.sub(r'[^\w]', ' ', entry2)
return data
def remove_stopwords_from_tokens(data):
stop_words = set(stopwords.words("english"))
new_list = []
for index, entry in enumerate(data):
no_stopwords = ""
entry = entry.split()
for word in entry:
if word not in stop_words:
no_stopwords = no_stopwords + " " + word
new_list.append(no_stopwords)
return new_list
def stemming (data):
st = PorterStemmer()
for index, entry in enumerate(data):
data[index] = st.stem(entry)
return data
def read_data():
raw_data_orig = pd.read_csv("C:/IR Course/Adv -IR/IATI.csv")
raw_data_orig = raw_data_orig.sample(500)
#raw_data_orig = open("C:/IR Course/Adv -IR/IATI3.pkl","rb")
#raw_data_orig = pickle.load(raw_data_orig, encoding='iso-8859-1')
raw_data_orig = raw_data_orig[raw_data_orig['description'].notnull()]
return raw_data_orig
query ="climate change and environmental degradation"
def preprocess(pdf):
for index, row in pdf.iterrows():
row['description'] = " ".join(stemming(remove_stopwords_from_tokens(remove_punctuation(set_tokens_to_lowercase(row['description'].split(" "))))))
return pdf
#preprocess documents
raw_data= preprocess(read_data())
#now preprocess query
query = " ".join(stemming(remove_stopwords_from_tokens(remove_punctuation(set_tokens_to_lowercase(query.split(" "))))))
rownames = raw_data["iati-identifier"]
#vectorise and get tfidf values
vectorizer = TfidfVectorizer()
vectorized_iati = vectorizer.fit_transform(raw_data["description"])
tdm = pd.DataFrame(vectorized_iati.toarray(), columns = vectorizer.get_feature_names())
tdm=tdm.set_index(rownames)
#now vectorise query
vectorized_query=vectorizer.transform(pd.Series(query))
query = pd.DataFrame(vectorized_query.toarray(), columns = vectorizer.get_feature_names())
# get cosine similarity
def cos_sim (pdf, qdf):
f_similarity={}
for index, row in qdf.iterrows():
for index2, row2 in pdf.iterrows():
cos_result = cosine_similarity(np.array(row).reshape(1, row.shape[0]), np.array(row2).reshape(1, row2.shape[0]))
f_similarity[index2] = round(float(cos_result),5)
return f_similarity
cosine_scores=cos_sim (tdm, query)
#now rank
final_rank= sorted(cosine_scores.items(), key=operator.itemgetter(1), reverse=True)
final_rank = final_rank[0:5]
rownames = rownames.tolist()
unprocessed = read_data()
for item in final_rank:
if item[0] in rownames:
print('IATI-IDENTIFIER {0} DESCRIPTION {1}'.format(item[0],unprocessed.iloc[rownames.index(item[0]),2]))
## IATI-IDENTIFIER 41AAA-11960-011 DESCRIPTION UNFPA Myanmar Strengthened capacities to effectively forecast, procure, distribute and track the delivery of sexual and reproductive health commodities, ensuring resilient supply chains activities
## IATI-IDENTIFIER 41AAA-21339-002 DESCRIPTION UNFPA Colombia other-funded Activities to increase capacity to prevent gender-based violence and harmful practices and enable the delivery of multisectoral services, including in humanitarian settings activities implemented by UNFPA
## IATI-IDENTIFIER 41AAA-20342-001 DESCRIPTION UNFPA Botswana regular-funded Activities to increase availability of evidence through cutting-edge in-depth analysis on population dynamics, sexual and reproductive health, HIV and their linkages to poverty eradication and sustainable development activities implemented by NGO
## IATI-IDENTIFIER 41AAA-21192-001 DESCRIPTION UNFPA Guinea-Bissau Enhanced capacities to develop and implement policies, including financial protection mechanisms, that prioritize access to information and services for sexual and reproductive health and reproductive rights for those furthest behind, including in humanitarian settings activities
## IATI-IDENTIFIER 41120-3110 DESCRIPTION UNFPA Sudan other-funded Activities to increase national capacity to strengthen enabling environments, increase demand for and supply of modern contraceptives and improve quality family planning services that are free of coercion, discrimination and violence activities implemented by NGO
Exercise:
From the IATI10k.csv file, extract a sample of records (for example 100 rows) then do the following:
1. Put the description column through pre-processing. Make a decision on what preprocessing routines would be suitable.
2. Set up a suitable query to interrogate the document collection.
3. Construct the inverted index with tf-idf scores. Ensure that the query has also been converted to a vector with tf-idf scores.
4.Compare the query vector with all other vectors in the document collection and calculate cosine similarity. Then store in a dictionary the iati-identifer field as a key with the cosine score as a value in a Python dictionary.
5. Rank the dictionary by cosine scores (value field in the dictionary) and print the top 10 scores (sort in ascending value).
Use probability to determine relevance. How well does a document satisfy the query ?
An IR sytem has an uncertain understanding of the user query and makes an uncertain guess of whether a document satisifes the query.
Probability theory provides a principled foundation for such reasoning under uncertainty
The query and the documents are all observations from random variables . In the vector-based models, we assume they are vectors, but here we assume they are the data observed from random variables
And so, the problem of retrieval becomes to estimate the probability of relevance
In this category of models, there are different variants.
Classical probabilistic retrieval models
Binary Independence Model
Okapi BM25
Bayesian networks for text retrieval
Language model approach to IR
Probability Ranking Principle
In query likelihood, our assumption is that this probability of relevance can be approximated by the probability of query given a document and relevance.
How do we compute this conditional probability?
This is where we build a Language Model.
What is a language model ?
“The goal of a language model is to assign a probability to a sequence of words by means of a probability distribution” –Wikipedia
To understand what a language model, must know what is a:
• probability distribution
• discrete random variable
In a unigram language model we estimate (and predict) the likelihood of each word independent of any other word
Defines a probability distribution over individual words
Sequences of words can be assigned a probability by multiplying their individual probabilities:
P(university of north carolina) = P(university) x P(of) x P(north) x P(carolina) = (2/20) x (4/20) x (2/20) x (1/20) = 0.0001
There are two important steps in language modeling
‣ estimation: observing text and estimating the probability of each word
‣ prediction: using the language model to assign a probability to a span of text.
General estimation approach:
‣ tokenize/split the text into terms
‣ count the total number of term occurrences (N)
‣ count the number of occurrences of each term (tft)
‣ assign term t a probability equal to
• Suppose we have a document D, with language model
• We can use this language model to determine the probability of a particular sequence of text
• How? We multiple the probability associated with each term in the sequence!
Question:
What is the probability given by this language model to the sequence of text “rocky is a boxer” or “a boxer is a pet”?
To summarise how is the document model estimated for each document?
• Objective: rank documents based on the probability that they are on the same topic as the query
• Solution:
‣ Score each document (denoted by D) according to the probability given by its language model to the query (denoted by Q)
‣ Rank documents in descending order of score
Every document in the collection is associated with a language model
• Let denote the language model associated with document D
• Think of a “black-box”: given a word, it outputs a probability
Let P(t|θD) denote the probability given by to term t
Question:
Which would be the top-ranked document and what would be its score?
P(q|M1) > P(q|M2)
There are (at least) two issues with scoring documents based on query terms
A document with a single missing query-term will receive a score of zero (similar to boolean AND)
• Where is IDF?
• No attempt is made to suppress the contribution of terms that are frequent in the document and also frequent in general (appear in many documents)?
• The goal of smoothing is to …
‣ Decrease the probability of observed outcomes
‣ Increase the probability of unobserved outcomes
Add One Smoothing
A more effective approach to smoothing for information retrieval is called linear interpolation
Let denote the language model associated with document D
• Let denote the language model associated with the entire collection
• Using linear interpolation, the probability given by the document language model to term *t is:
As before, a document’s score is given by the probability that it “generated” the query
• As before, this is given by multiplying the individual query-term probabilities
• However, the probabilities are obtained using the linearly interpolated language model
Without smoothing, the query-likelihood model ignores how frequently the term occurs in general!
import nltk
import sys
import codecs
import nltk
from nltk.corpus import stopwords
import csv
import pandas
import re
import numpy as np
df = pandas.read_csv('C:/IR Course/Adv -IR/IATI10k.csv', header = 0, encoding="iso-8859-1")
df = df[df.description.notnull()]
def set_tokens_to_lowercase(data):
for index, entry in enumerate(data):
data[index] = entry.lower()
return data
def remove_punctuation(data):
symbols = ",.!"
for index, entry in enumerate(symbols):
for index2, entry2 in enumerate (data):
data[index2] = re.sub(r'[^\w]', ' ', entry2)
data[index2] = entry2.strip()
return data
def remove_stopwords_from_tokens(data):
stop_words = set(stopwords.words("english"))
stop_words.add(" ")
new_list = []
for index, entry in enumerate(data):
if entry not in stop_words:
new_list.append(entry)
return new_list
def clean_df(pdf):
for index, row in pdf.iterrows():
row['description'] = remove_stopwords_from_tokens(remove_punctuation(set_tokens_to_lowercase(row['description'].split())))
row['description'] = " ".join(x for x in row['description'])
return pdf
def calc_docscore(pdf, pqry):
col_names = ['Description', 'score']
f_df2 = pandas.DataFrame(columns = col_names)
for index, row in pdf.iterrows():
rank = []
docscore = 0
scored = score(row['description'])
for word in pqry.split(" "):
if word in scored.keys():
rank.append(float(scored[word] )+ float(allcounts[word]/total)/2)
if rank != []:
docscore = np.prod(np.array(rank))
f_df2.loc[index] = pandas.Series({'Description':row['description'], 'score':docscore})
return f_df2
def score (pstr):
fdict = {}
flist = pstr.split()
fdict = dict(nltk.FreqDist(flist))
for key, value in fdict.items():
fdict[key] = round(fdict[key]/len(flist),2)
return fdict
df = clean_df(df)
qry = "reduce transmission of HIV"
qry= remove_stopwords_from_tokens(remove_punctuation(set_tokens_to_lowercase(qry.split())))
qry = " ".join(x for x in qry)
allcounts = {}
for descript in df['description']:
tmp = dict(nltk.FreqDist(descript.split()))
for key, value in tmp.items():
if key not in allcounts:
allcounts[key] = value
else:
allcounts[key] = allcounts[key] + value
total = sum(allcounts.values())
df2=calc_docscore(df, qry)
df2sort_by_score = df2.sort_values('score', ascending=False)
print (df2sort_by_score[1:20])
## Description score
## 136 goal project contribute reduction hiv incidenc... 0.121396
## 180 sa school-based sexuality hiv prevention educa... 0.121396
## 142 hiv prevention treatment professional sex work... 0.121396
## 143 hiv prevention treatment professional sex work... 0.121396
## 360 ?improving diabetes care prevention piloting h... 0.121396
## 167 reduce hiv/aids prevalence prevention identify... 0.120252
## 9537 unops helps procure retroviral drugs hiv progr... 0.101396
## 311 pace uganda sub mildmay hiv related project fu... 0.101396
## 211 objective program increase use evidence-based ... 0.101396
## 319 ?psi subreceipent project hope usaid award eth... 0.101396
## 281 sfh south africa/psi sub tb hiv care award hts... 0.0913958
## 65 psi kenya helping 3ie build evidence towards u... 0.0913958
## 38 overall goal reduce mortality morbidity due ma... 0.0902525
## 114 expand strengthen high quality community-based... 0.0813958
## 266 goal program scale-up strengthen delivery qual... 0.0813958
## 242 increased access universal hiv prevention serv... 0.0813958
## 316 using user-centered approaches increase adopti... 0.0813958
## 149 global fund project funded zimbabwe ministry h... 0.0802525
## 249 hiv identification treatment program lesotho -... 0.0713958
Exercises:
Suppose the document collection contains two documents:
\(d_1\): Xyzzy reports a profit but revenue is down
\(d_2\): Quorus narrows quarter loss but revenue decreases further
The query is: “revenue down”
Calculate maximum liklihood estimates for terms in document 1 and document 2.
Apply the linear interpolation and calulate the score for query/document 1 and query/document 2.
D1: He likes to wink, he likes to drink
D2: He likes to drink, and drink, and drink
D3: The thing he likes to drink is ink
D4: The ink he likes to drink is pink
D5: He likes to wink, and drink pink ink
Query: “drink pink ink”
Write Python code to do the following:
Elasticsearch is a real-time distributed and open source full-text search and analytics engine.
It is accessible from RESTful web service interface and stores documents in JSON (see example https://json.org/example.html) format.
It is built on Java programming language and hence Elasticsearch can run on different platforms.
It enables users to explore very large amount of data at very high speed.
At heart it uses an inverted index as shown in Figure X. It maps terms to documents (and possibly positions in the documents) containing the term.
An index is a collection of documents, and a shard is a subset thereof. Documents are scored using tf-idf calculations.
To minimize index sizes, various compression techniques are used. For example, when storing the postings (which can get quite large).
Updating the index in elastic search is a delete followed by a re-insertion of the document. This keeps the data structures small and compact at the cost of an efficient update.
When new documents are added (perhaps via an update), the index changes are first buffered in memory.
Eventually, the index files in their entirety, are flushed to disk.
“With the incorporation of BERT this year into the ranking and featured snippets algorithm, Google has taken a huge leap forward into making search really about intent matching rather pure string matching”
Eli Schwartz, 2019
Vocabulary mismatch problem due to synonymy and polysemy.
The same word has different meanings.
A search engine might not be able to guess the right meaning if appropriate contexts are not provided.
IR-systems are as good as the query provided to them.
Queries are provided by the human, and human is the weak link in this chain.
So, high quality query is a must. With a very bad query, you can defeat any search engine.
A search query is : Windows
For the query, a search engine (like- Google) can show results of three types as following:
Computer OS: Wind ows
Windows of buildings
Combination of both (1) and (2)
It is not the intention of a search engine to provide results of type 3, i.e., combined results of Windows of OS and buildings.
Because, a user, who works in a building company, might not want Computer OS Windows as the output of the query.
The output should be building windows for this type of users.
On the other-hand, similarly. another user working as a software engineer, should get the output of Windows OS for the query.
This type of query results based on person’s interests is called personalized search engines.
It is one of the most challenging sides of Information Retrieval (IR) to provide results based on person’s interests and ranked the results accordingly.
Intended Learning Outcomes: By the end of Chapter 3 you will be able to:-
Machine Learning algorithms enable a machine to learn from examples. Only supervised machine learning will be investigated here. Using the examples - known as a trainng set, our chosen algorithm establishes a function for example as below:-
In the case of regression the objective is to find the a relationship among the input variables. Regression analysis helps in understanding how the dependent variable changes with respect to the independent variables.
In classification, the objective is to assign each of the input vector to one of a given number of discrete categories. An algorithm that implements classification is known as a classifier.
Text classification is a typical task in supervised machine learning (ML). Assigning categories to documents, which can be a web page, library book, media articles, gallery etc. has many applications like e.g. spam filtering, email routing, sentiment analysis etc.
An algorithm specifies which of k categories some input belongs to. To solve this task, the learning algorithm is usually asked to produce a function f(Real number) → {1, . . . , k}. When y=f(x), the model assigns an input described by vector x to a category identified by numeric code y.
The machine must predict the most probable category, class, or label for new examples.
High-level workflow of a text classification project
One of our main concerns when developing a classification model is whether the different classes are balanced . This means that the dataset contains an approximately equal portion of each class. For example, if we had two classes and a 95% of observations belonging to one of them, a dumb classifier which always output the majority class would have 95% accuracy, although it would fail all the predictions of the minority class. There are several ways of dealing with imbalanced datasets . One first approach is to undersample the majority class and oversample the minority one, so as to obtain a more balanced dataset. Other approach can be using other error metrics beyond accuracy such as the precision , the recall or the F1-score . These metrics are examined later. Looking at BBC News data, the % of observations belonging to each class can be obtained.
It is a common practice to carry out an exploratory data analysis in order to gain some insights from the data.
import pandas as pd
import matplotlib.pyplot as plt
import pickle
##read in data
df = pd.read_csv("C:/data/BBC/News_dataset.csv", sep=';')
#look at dimensions
df.shape
##look at top 6
## (2225, 4)
df.head()
##look at data in content column
## File_Name Content Category \
## 0 001.txt Ad sales boost Time Warner profit\r\n\r\nQuart... business
## 1 002.txt Dollar gains on Greenspan speech\r\n\r\nThe do... business
## 2 003.txt Yukos unit buyer faces loan claim\r\n\r\nThe o... business
## 3 004.txt High fuel prices hit BA's profits\r\n\r\nBriti... business
## 4 005.txt Pernod takeover talk lifts Domecq\r\n\r\nShare... business
##
## Complete_Filename
## 0 001.txt-business
## 1 002.txt-business
## 2 003.txt-business
## 3 004.txt-business
## 4 005.txt-business
df['Content'].head()
#look at the content in the first row
## 0 Ad sales boost Time Warner profit\r\n\r\nQuart...
## 1 Dollar gains on Greenspan speech\r\n\r\nThe do...
## 2 Yukos unit buyer faces loan claim\r\n\r\nThe o...
## 3 High fuel prices hit BA's profits\r\n\r\nBriti...
## 4 Pernod takeover talk lifts Domecq\r\n\r\nShare...
## Name: Content, dtype: object
df.loc[1, 'Content']
## 'Dollar gains on Greenspan speech\r\n\r\nThe dollar has hit its highest level against the euro in almost three months after the Federal Reserve head said the US trade deficit is set to stabilise.\r\n\r\nAnd Alan Greenspan highlighted the US government\'s willingness to curb spending and rising household savings as factors which may help to reduce it. In late trading in New York, the dollar reached $1.2871 against the euro, from $1.2974 on Thursday. Market concerns about the deficit has hit the greenback in recent months. On Friday, Federal Reserve chairman Mr Greenspan\'s speech in London ahead of the meeting of G7 finance ministers sent the dollar higher after it had earlier tumbled on the back of worse-than-expected US jobs data. "I think the chairman\'s taking a much more sanguine view on the current account deficit than he\'s taken for some time," said Robert Sinche, head of currency strategy at Bank of America in New York. "He\'s taking a longer-term view, laying out a set of conditions under which the current account deficit can improve this year and next."\r\n\r\nWorries about the deficit concerns about China do, however, remain. China\'s currency remains pegged to the dollar and the US currency\'s sharp falls in recent months have therefore made Chinese export prices highly competitive. But calls for a shift in Beijing\'s policy have fallen on deaf ears, despite recent comments in a major Chinese newspaper that the "time is ripe" for a loosening of the peg. The G7 meeting is thought unlikely to produce any meaningful movement in Chinese policy. In the meantime, the US Federal Reserve\'s decision on 2 February to boost interest rates by a quarter of a point - the sixth such move in as many months - has opened up a differential with European rates. The half-point window, some believe, could be enough to keep US assets looking more attractive, and could help prop up the dollar. The recent falls have partly been the result of big budget deficits, as well as the US\'s yawning current account gap, both of which need to be funded by the buying of US bonds and assets by foreign firms and governments. The White House will announce its budget on Monday, and many commentators believe the deficit will remain at close to half a trillion dollars.'
#get all data from category and ut in list
cat = list(df['Category'])
#get all unique values
cat2 = set(cat)
#add counts to dictionary
dict1 = {}
for value in cat2:
dict1[value] = cat.count(value)
print (dict1)
## {'entertainment': 386, 'tech': 401, 'politics': 417, 'business': 510, 'sport': 511}
plot contents of dictionary
plt.bar(range(len(dict1)), dict1.values(), align='center')
plt.xticks(range(len(dict1)), list(dict1.keys()))
plt.show()
#length of each news article
df['News_length'] = df['Content'].str.len()
# have a look at top 6
df['News_length'].head()
#get basic stats on column
## 0 2569
## 1 2257
## 2 1557
## 3 2421
## 4 1575
## Name: News_length, dtype: int64
df['News_length'].describe()
#how many articles with more than 10k words
## count 2225.000000
## mean 2274.363596
## std 1370.782663
## min 506.000000
## 25% 1454.000000
## 50% 1978.000000
## 75% 2814.000000
## max 25596.000000
## Name: News_length, dtype: float64
df_more10k = df[df['News_length'] > 10000]
len(df_more10k)
#95 percent of values have a news length of value given by quantile
## 7
quantile_95 = df['News_length'].quantile(0.95)
print (quantile_95)
## 4304.0
df_95 = df[df['News_length'] < quantile_95]
#get only category and newslength
df_95_2 = df_95[['Category', 'News_length']]
#have a look at it
df_95_2.head()
# all categories
## Category News_length
## 0 business 2569
## 1 business 2257
## 2 business 1557
## 3 business 2421
## 4 business 1575
print (cat2)
## {'entertainment', 'tech', 'politics', 'business', 'sport'}
#make box plot
dictc = {}
for value in cat2:
news_len = df_95_2[df_95_2['Category'] == value]
news_len = list (news_len['News_length'])
dictc[value]= news_len
dictc['business'][0:6]
box_plot_data=[dictc['business'],dictc['entertainment'],dictc['politics'],dictc['sport'], dictc['tech']]
plt.boxplot(box_plot_data,patch_artist=True,labels=['business', 'entertainment','politics','sport','tech'])
plt.show()
Nice explanation on box plots here:https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51
Before creating any feature from the raw text, we must perform a cleaning process to ensure no distortions are introduced to the model to be used.
See NLP Intro course for further explanation https://github.com/salihadfid1/NLPINTROLIVE/blob/master/Intro-to-NLP-v4.html
With ML tasks, it is possible to generate useful feature in a variety of ways. For example:
Feature engineering is an essential part of building any intelligent system. As Andrew Ng says:
“Coming up with features is difficult, time-consuming, requires expert knowledge. ‘Applied machine learning’ is basically feature engineering.”
Feature engineering is the process of * transforming data into features to act as inputs for machine learning models such that good quality features help in improving the model performance.
When dealing with text data, there are several ways of obtaining features that represent the data. A few common methods are delineated below.
In order to represent our text, every row of the dataset will be a single document of the corpus. The columns (features) will be different depending of which feature creation method we choose:
Word Count Vectors
With this method, every column is a term from the corpus, and every cell represents the frequency count of each term in each document.
TF–IDF Vectors
TF-IDF is a score that represents the relative importance of a term in the document and the entire corpus.
See NLP Intro course for further explanation
https://github.com/salihadfid1/NLPINTROLIVE/blob/master/Intro-to-NLP-v4.html
These two methods (Word Count Vectors and TF-IDF Vectors) are often named Bag of Words method, since the order of the words in a sentence is ignored. The following methods are more advanced as they somehow preserve the order of the words and their lexical considerations.
Word Embeddings
The position of a word within the vector space is learned from text and is based on the words that surround the word when it is used.
See NLP Intro course for further explanation
Text based or NLP based featuress
We can manually create any feature that we think may be of importance when discerning between categories (i.e. word density, number of characters or words, etc…). We can also use NLP based features using Part of Speech models, which can tell us, for example, if a word is a noun or a verb, and then use the frequency distribution of the PoS tags.
Topic Models
Methods such as Latent Dirichlet Allocation attempts to represent every topic by a probabilistic distribution over words, in what is known as topic modeling
In feature selection, you try to figure out the most relevant features that relate the most to the class label
TF-IDF vectors have been chosen to represent the documents in this BBC dataset corpus due to its simplicity and speed in the creation of vectors.
When creating the features with this method, some parameters have to be chosen:
N-gram range: unigrams, bigrams, trigrams ?
Maximum/Minimum Document Frequency: when building the vocabulary, ignore terms that have a document frequency strictly higher/lower than the given threshold.
Maximum features: Choose the top N features ordered by term frequency across the corpus.
The following parameters have been chosen:
Machine learning models require numeric features and labels to provide a prediction. A dictionary to map each label to a numerical ID has been created. This mapping scheme is as below:
Machine Learning algorithms enable a machine to learn from examples. Only supervised machine learning will be investigated here. Using the examples - known as a trainng set, our chosen algorithm establishes a function, example shown below.
A test set needs to be set up in order to prove the quality of the models when predicting on unseen data.
A random split with 85% of the observations composing the training test and 15% of the observations composing the test set will be established. The following steps are then undertaken:
from collections import Counter
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import chi2
os.chdir("C:/IR Course/Adv -IR/")
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from pprint import pprint
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import ShuffleSplit
plt.style.use('ggplot')
def set_tokens_to_lowercase(data):
for index, entry in enumerate(data):
data[index] = entry.lower()
return data
def remove_punctuation(data):
symbols = ",.!"
for index, entry in enumerate(symbols):
for index2, entry2 in enumerate (data):
data[index2] = re.sub(r'[^\w]', ' ', entry2)
return data
def remove_stopwords_from_tokens(data):
stop_words = set(stopwords.words("english"))
new_list = []
for index, entry in enumerate(data):
no_stopwords = ""
entry = entry.split()
for word in entry:
if word not in stop_words:
no_stopwords = no_stopwords + " " + word
new_list.append(no_stopwords)
return new_list
def lemmatiser (pdf, pcol):
wordnet_lemmatizer = WordNetLemmatizer()
lemmatized_text_list = []
for row in range(len(pdf)):
# Create an empty list containing lemmatized words
lemmatized_list = []
# Save the text and its words into an object
text = pdf.loc[row, pcol]
#print(text)
text_words = text.split(" ")
# Iterate through every word to lemmatize
for word in text_words:
lemmatized_list.append(wordnet_lemmatizer.lemmatize(word, pos="v"))
# Join the list
lemmatized_text = " ".join(lemmatized_list)
# Append to the list containing the texts
lemmatized_text_list.append(lemmatized_text)
return lemmatized_text_list
# remove \r and \n
df['Content_Parsed_1'] = df['Content'].str.replace("\r", " ")
df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace("\n", " ")
# Lowercasing the text
df['Content_Parsed_2'] = df['Content_Parsed_1'].str.lower()
# remove punctuation
df['Content_Parsed_3'] = pd.Series(remove_punctuation (list(df['Content_Parsed_2'])))
#remove possessive
df['Content_Parsed_4'] = df['Content_Parsed_3'].str.replace("'s", "")
df.head()
## File_Name Content Category \
## 0 001.txt Ad sales boost Time Warner profit\r\n\r\nQuart... business
## 1 002.txt Dollar gains on Greenspan speech\r\n\r\nThe do... business
## 2 003.txt Yukos unit buyer faces loan claim\r\n\r\nThe o... business
## 3 004.txt High fuel prices hit BA's profits\r\n\r\nBriti... business
## 4 005.txt Pernod takeover talk lifts Domecq\r\n\r\nShare... business
##
## Complete_Filename News_length \
## 0 001.txt-business 2569
## 1 002.txt-business 2257
## 2 003.txt-business 1557
## 3 004.txt-business 2421
## 4 005.txt-business 1575
##
## Content_Parsed_1 \
## 0 Ad sales boost Time Warner profit Quarterly...
## 1 Dollar gains on Greenspan speech The dollar...
## 2 Yukos unit buyer faces loan claim The owner...
## 3 High fuel prices hit BA's profits British A...
## 4 Pernod takeover talk lifts Domecq Shares in...
##
## Content_Parsed_2 \
## 0 ad sales boost time warner profit quarterly...
## 1 dollar gains on greenspan speech the dollar...
## 2 yukos unit buyer faces loan claim the owner...
## 3 high fuel prices hit ba's profits british a...
## 4 pernod takeover talk lifts domecq shares in...
##
## Content_Parsed_3 \
## 0 ad sales boost time warner profit quarterly...
## 1 dollar gains on greenspan speech the dollar...
## 2 yukos unit buyer faces loan claim the owner...
## 3 high fuel prices hit ba s profits british a...
## 4 pernod takeover talk lifts domecq shares in...
##
## Content_Parsed_4
## 0 ad sales boost time warner profit quarterly...
## 1 dollar gains on greenspan speech the dollar...
## 2 yukos unit buyer faces loan claim the owner...
## 3 high fuel prices hit ba s profits british a...
## 4 pernod takeover talk lifts domecq shares in...
#lemmatise
df['Content_Parsed_5'] = lemmatiser (df, 'Content_Parsed_4')
df['Content_Parsed_6'] = df['Content_Parsed_5']
#remove stopwords
df['Content_Parsed_6'] = pd.Series(remove_stopwords_from_tokens(list(df['Content_Parsed_6'])))
list_columns = ["File_Name", "Category", "Complete_Filename", "Content", "Content_Parsed_6"]
df = df[list_columns]
df = df.rename(columns={'Content_Parsed_6': 'Content_Parsed'})
print(df.loc[3,'Content_Parsed'])
## high fuel price hit ba profit british airways blame high fuel price 40 drop profit report result three months 31 december 2004 airline make pre tax profit â 75m 141m compare â 125m year earlier rod eddington ba chief executive say result respectable third quarter fuel cost rise â 106m 47 3 ba profit still better market expectation â 59m expect rise full year revenues help offset increase price aviation fuel ba last year introduce fuel surcharge passengers october increase â 6 â 10 one way long haul flight short haul surcharge raise â 2 50 â 4 leg yet aviation analyst mike powell dresdner kleinwort wasserstein say ba estimate annual surcharge revenues â 160m still way short additional fuel cost predict extra â 250m turnover quarter 4 3 â 1 97bn benefit rise cargo revenue look ahead full year result march 2005 ba warn yield average revenues per passenger expect decline continue lower price face competition low cost carriers however say sales would better previously forecast year march 2005 total revenue outlook slightly better previous guidance 3 3 5 improvement anticipate ba chairman martin broughton say ba previously forecast 2 3 rise full year revenue also report friday passenger number rise 8 1 january aviation analyst nick van den brul bnp paribas describe ba latest quarterly result pretty modest quite good revenue side show impact fuel surcharge positive cargo development however operate margins cost impact fuel strong say since 11 september 2001 attack unite state ba cut 13 000 job part major cost cut drive focus remain reduce controllable cost debt whilst continue invest products mr eddington say example take delivery six airbus a321 aircraft next month start improvements club world flat bed ba share close four pence 274 5 pence
df.head()
## File_Name Category Complete_Filename \
## 0 001.txt business 001.txt-business
## 1 002.txt business 002.txt-business
## 2 003.txt business 003.txt-business
## 3 004.txt business 004.txt-business
## 4 005.txt business 005.txt-business
##
## Content \
## 0 Ad sales boost Time Warner profit\r\n\r\nQuart...
## 1 Dollar gains on Greenspan speech\r\n\r\nThe do...
## 2 Yukos unit buyer faces loan claim\r\n\r\nThe o...
## 3 High fuel prices hit BA's profits\r\n\r\nBriti...
## 4 Pernod takeover talk lifts Domecq\r\n\r\nShare...
##
## Content_Parsed
## 0 ad sales boost time warner profit quarterly p...
## 1 dollar gain greenspan speech dollar hit highe...
## 2 yukos unit buyer face loan claim owners embat...
## 3 high fuel price hit ba profit british airways...
## 4 pernod takeover talk lift domecq share uk dri...
ADD LABELS
category_codes = {
'business': 0,
'entertainment': 1,
'politics': 2,
'sport': 3,
'tech': 4
}
# Category mapping
df['Category_Code'] = df['Category']
df = df.replace({'Category_Code':category_codes})
## ensure no other category in dataframe
for index, row in df.iterrows():
if row['Category_Code'] not in [0,1,2,3,4]:
df = df.drop (index)
df.tail()
## File_Name Category Complete_Filename \
## 2220 397.txt tech 397.txt-tech
## 2221 398.txt tech 398.txt-tech
## 2222 399.txt tech 399.txt-tech
## 2223 400.txt tech 400.txt-tech
## 2224 401.txt tech 401.txt-tech
##
## Content \
## 2220 BT program to beat dialler scams\r\n\r\nBT is ...
## 2221 Spam e-mails tempt net shoppers\r\n\r\nCompute...
## 2222 Be careful how you code\r\n\r\nA new European ...
## 2223 US cyber security chief resigns\r\n\r\nThe man...
## 2224 Losing yourself in online gaming\r\n\r\nOnline...
##
## Content_Parsed Category_Code
## 2220 bt program beat dialler scam bt introduce two... 4
## 2221 spam e mail tempt net shoppers computer users... 4
## 2222 careful code new european directive could put... 4
## 2223 us cyber security chief resign man make sure ... 4
## 2224 lose online game online role play game time c... 4
TRAIN - TEST SPLIT
To prove the quality of the model a subset of the data is set apart for testing. The training data is used to tune hyperparameters and then test performance on the unseen data of the test set.
Test set size of 15% of the full dataset is used.
X_train, X_test, y_train, y_test = train_test_split(df['Content_Parsed'],
df['Category_Code'],
test_size=0.15,
random_state=8)
PARAMETERS FOR tfidfvectorizer in SCILEARN
# Parameter election
ngram_range = (1,2)
min_df = 10
max_df = 1.
max_features = 300
tfidf = TfidfVectorizer(encoding='utf-8',
ngram_range=ngram_range,
stop_words=None,
lowercase=False,
max_df=max_df,
min_df=min_df,
max_features=max_features,
norm='l2',
sublinear_tf=True)
features_train = tfidf.fit_transform(X_train).toarray()
labels_train = y_train
labels_train = np.array(labels_train, dtype=np.int)
#training data
print(features_train.shape)
## (1891, 300)
features_test = tfidf.transform(X_test).toarray()
labels_test = y_test
labels_test = np.array(labels_test, dtype=np.int)
#test data
print(features_test.shape)
## (334, 300)
for Product, category_id in sorted(category_codes.items()):
features_chi2 = chi2(features_train, labels_train == category_id)
indices = np.argsort(features_chi2[0])
feature_names = np.array(tfidf.get_feature_names())[indices]
unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
print("# '{}' category:".format(Product))
print(" . Most correlated unigrams:\n. {}".format('\n. '.join(unigrams[-5:])))
print(" . Most correlated bigrams:\n. {}".format('\n. '.join(bigrams[-2:])))
print("")
## # 'business' category:
## . Most correlated unigrams:
## . firm
## . market
## . economy
## . growth
## . bank
## . Most correlated bigrams:
## . last year
## . year old
##
## # 'entertainment' category:
## . Most correlated unigrams:
## . tv
## . music
## . star
## . award
## . film
## . Most correlated bigrams:
## . mr blair
## . prime minister
##
## # 'politics' category:
## . Most correlated unigrams:
## . minister
## . blair
## . election
## . party
## . labour
## . Most correlated bigrams:
## . prime minister
## . mr blair
##
## # 'sport' category:
## . Most correlated unigrams:
## . win
## . side
## . game
## . team
## . match
## . Most correlated bigrams:
## . say mr
## . year old
##
## # 'tech' category:
## . Most correlated unigrams:
## . digital
## . technology
## . computer
## . software
## . users
## . Most correlated bigrams:
## . year old
## . say mr
Once feature vectors are built, machine learning classification models are used to find the one that performs best on the data. The following model will be used:
The methodology used to train each model is as follows:
svc_0= svm.SVC(C=0.1, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
max_iter=-1, probability=True, random_state=8, shrinking=True, tol=0.001,
verbose=False)
print('Parameters currently in use:\n')
## Parameters currently in use:
print(svc_0.get_params())
## {'C': 0.1, 'cache_size': 200, 'class_weight': None, 'coef0': 0.0, 'decision_function_shape': 'ovr', 'degree': 3, 'gamma': 'auto', 'kernel': 'linear', 'max_iter': -1, 'probability': True, 'random_state': 8, 'shrinking': True, 'tol': 0.001, 'verbose': False}
svc_0.fit(features_train, labels_train)
## SVC(C=0.1, cache_size=200, class_weight=None, coef0=0.0,
## decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
## max_iter=-1, probability=True, random_state=8, shrinking=True, tol=0.001,
## verbose=False)
svc_pred = svc_0.predict(features_test)
https://towardsdatascience.com/the-5-classification-evaluation-metrics-you-must-know-aa97784ff226
print("The training accuracy is: ")
## The training accuracy is:
print(accuracy_score(labels_train, svc_0.predict(features_train)))
## 0.960338445267
print("Classification report")
## Classification report
print(classification_report(labels_test,svc_pred))
## precision recall f1-score support
##
## 0 0.87 0.98 0.92 81
## 1 0.96 0.96 0.96 49
## 2 0.97 0.88 0.92 72
## 3 0.99 0.99 0.99 72
## 4 0.93 0.88 0.91 60
##
## avg / total 0.94 0.94 0.94 334
Using the SpamHam dataset:-
Intended Learning Outcomes
1. Describe sentiment analysis.
2. Describe how sentiment analysis could be used to determine sentiment around a new product.
3. Give examples where sentiment in language may prove to be challenging to detect.
4. List advantages of sentiment analysis systems.
5. Describe how sentiment analysis can be undertaken.
5. Identify main steps involved in building a machine learning based sentiment classifer.
6. Take a data set and build a supervised based machine learning sentiment analysis classifer in Python.
Sentiment analysis is a text classification task. Give a phrase, or a list of phrases the classifier should indicate if the phrase is positive, negative or neutral.
Sentiment Analysis systems typically identify the following attributes of an expression:
Polarity: if the speaker express a positive or negative opinion.
Subject: the thing that is being talked about.
Opinion holder: the person, or entity that expresses the opinion.
In essence: “It is the process of determining the emotional tone behind a series of words, used to gain an understanding of the the attitudes, opinions and emotions expressed within an online mention” (Bannister, 2018).
Used in:
Sentiment analysis systems allow companies to make sense of the sea of unstructured text by automating business processes, getting actionable insights, and saving hours of manual data processing.
Sentiment analysis can be applied at different levels of scope:
Document level sentiment analysis obtains the sentiment of a complete document or paragraph.
Sentence level sentiment analysis obtains the sentiment of a single sentence.
Sub-sentence level sentiment analysis obtains the sentiment of sub-expressions within a sentence.
There are many types and flavors of sentiment analysis am systems that focus on polarity (positive, negative, neutral) to systems that detect feelings and emotions (angry, happy, sad, etc) or identify intentions (e.g. interested v. not interested).
For example Aspect-based sentiment analysis indicates sentiment on different features related to a product.
Another example is Intent Analysis basically detects what people want to do with a text rather than what people say with that text. Look at the following examples:
“Your customer support is a disaster. I’ve been on hold for 20 minutes”.
“I would like to know how to replace the cartridge”.
“Can you help me fill out this form?”
A human being has no problems detecting the complaint in the first text, the question in the second text, and the request in the third text. However, machines can have some problems to identify those. Sometimes, the intended action can be inferred from the text, but sometimes, inferring it requires some contextual knowledge.
Human language is complex. Teaching a machine to analyse the various grammatical nuances, cultural variations, slang and misspellings that occur in online mentions is a difficult process. Teaching a machine to understand how context can affect tone is even more difficult.
Contextual understanding
Contextual understanding is pivotal for accurate sentiment detection.
For example: “I am craving McDonald’s so bad”.
Most systems will misinterpret this statement as negative by seeing the the phrase “so bad”
Sentiment Ambiguity
“Can you recommend any good holiday destinations?”
This statement doesn’t express any sentiment, although it uses the positive sentiment word “good”
Sarcasm
“This phone has an awesome battery back-up of 2 hours.”
Obviously, this statement is negative, even though it has the positive word “happy”
Comparatives
“Iphone is much better than Samsung.”
Most Sentiment analyser tools cannot “pick sides” when they find comparative statements like the one mentioned here, they can only pick the sentiment based on keywords.
Scalability:
There’s just too much data to process manually. Sentiment analysis allows processing of data at scale in a efficient and cost-effective way.
Real-time analysis:
Sentiment analysis can to identify critical information that allows situational awareness during specific scenarios in real-time.
Consistent criteria:
By using a centralized sentiment analysis system, companies can apply the same criteria to all of their data. This helps to reduce errors and improve data consistency.
There are many methods and algorithms to implement sentiment analysis systems, which can be classified as:
Rule-based systems that perform sentiment analysis based on a set of manually crafted rules.
Automatic systems that rely on machine learning techniques to learn from data.
Hybrid systems that combine both rule based and automatic approaches.
Below, are steps that are typically undertaken in the building an automatic sentiment analysis system.
In a more practical sense, the objective here is to take text and produce a label (or labels) that summarizes the sentiment of this text, e.g. positive, neutral, and negative.
To solve this problem, typical machine learning pipeline is followed.
Import the Dataset
airline_tweets = pd.read_csv("C:/data/tweets/Tweets.csv")
airline_tweets.head()
## tweet_id airline_sentiment airline_sentiment_confidence \
## 0 570306133677760513 neutral 1.0000
## 1 570301130888122368 positive 0.3486
## 2 570301083672813571 neutral 0.6837
## 3 570301031407624196 negative 1.0000
## 4 570300817074462722 negative 1.0000
##
## negativereason negativereason_confidence airline \
## 0 NaN NaN Virgin America
## 1 NaN 0.0000 Virgin America
## 2 NaN NaN Virgin America
## 3 Bad Flight 0.7033 Virgin America
## 4 Can't Tell 1.0000 Virgin America
##
## airline_sentiment_gold name negativereason_gold retweet_count \
## 0 NaN cairdin NaN 0
## 1 NaN jnardino NaN 0
## 2 NaN yvonnalynn NaN 0
## 3 NaN jnardino NaN 0
## 4 NaN jnardino NaN 0
##
## text tweet_coord \
## 0 @VirginAmerica What @dhepburn said. NaN
## 1 @VirginAmerica plus you've added commercials t... NaN
## 2 @VirginAmerica I didn't today... Must mean I n... NaN
## 3 @VirginAmerica it's really aggressive to blast... NaN
## 4 @VirginAmerica and it's a really big bad thing... NaN
##
## tweet_created tweet_location user_timezone
## 0 2015-02-24 11:35:52 -0800 NaN Eastern Time (US & Canada)
## 1 2015-02-24 11:15:59 -0800 NaN Pacific Time (US & Canada)
## 2 2015-02-24 11:15:48 -0800 Lets Play Central Time (US & Canada)
## 3 2015-02-24 11:15:36 -0800 NaN Pacific Time (US & Canada)
## 4 2015-02-24 11:14:45 -0800 NaN Pacific Time (US & Canada)
Exploratory Data Analysis
from sklearn.decomposition import PCA
from sklearn import feature_extraction
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
airline_tweets = pd.read_csv("C:/data/tweets/Tweets.csv")
airline_tweets.head()
## tweet_id airline_sentiment airline_sentiment_confidence \
## 0 570306133677760513 neutral 1.0000
## 1 570301130888122368 positive 0.3486
## 2 570301083672813571 neutral 0.6837
## 3 570301031407624196 negative 1.0000
## 4 570300817074462722 negative 1.0000
##
## negativereason negativereason_confidence airline \
## 0 NaN NaN Virgin America
## 1 NaN 0.0000 Virgin America
## 2 NaN NaN Virgin America
## 3 Bad Flight 0.7033 Virgin America
## 4 Can't Tell 1.0000 Virgin America
##
## airline_sentiment_gold name negativereason_gold retweet_count \
## 0 NaN cairdin NaN 0
## 1 NaN jnardino NaN 0
## 2 NaN yvonnalynn NaN 0
## 3 NaN jnardino NaN 0
## 4 NaN jnardino NaN 0
##
## text tweet_coord \
## 0 @VirginAmerica What @dhepburn said. NaN
## 1 @VirginAmerica plus you've added commercials t... NaN
## 2 @VirginAmerica I didn't today... Must mean I n... NaN
## 3 @VirginAmerica it's really aggressive to blast... NaN
## 4 @VirginAmerica and it's a really big bad thing... NaN
##
## tweet_created tweet_location user_timezone
## 0 2015-02-24 11:35:52 -0800 NaN Eastern Time (US & Canada)
## 1 2015-02-24 11:15:59 -0800 NaN Pacific Time (US & Canada)
## 2 2015-02-24 11:15:48 -0800 Lets Play Central Time (US & Canada)
## 3 2015-02-24 11:15:36 -0800 NaN Pacific Time (US & Canada)
## 4 2015-02-24 11:14:45 -0800 NaN Pacific Time (US & Canada)
library(reticulate)
reticulate::repl_python()
airline_tweets.airline.value_counts().plot(kind='pie', autopct='%1.0f%%')
plt.show()
airline_tweets.airline_sentiment.value_counts().plot(kind='pie', autopct='%1.0f%%', colors=["red", "yellow", "green"])
airl= list(set(list(airline_tweets['airline'])))
pos=[]
neg=[]
neut=[]
for airlines in airl:
filter = (airline_tweets['airline'] == airlines) & (airline_tweets['airline_sentiment'] == 'negative')
num = len(airline_tweets[filter])
neg.append(num)
filter = (airline_tweets['airline'] == airlines) & (airline_tweets['airline_sentiment'] == 'positive')
num = len(airline_tweets[filter])
pos.append(num)
filter = (airline_tweets['airline'] == airlines) & (airline_tweets['airline_sentiment'] == 'neutral')
neut = len(airline_tweets[filter])
data = [pos,neg,neut]
fig, ax = plt.subplots()
ax.set_xticks(range(len(airl)))
ax.set_xticklabels(airl, rotation='vertical')
X = np.arange(6)
ax.bar(X + 0.00, data[0], color = 'b', width = 0.25)
ax.bar(X + 0.25, data[1], color = 'g', width = 0.25)
ax.bar(X + 0.50, data[2], color = 'r', width = 0.25)
plt.show()
Plotting a pie chart in Python -further help https://pythonspot.com/matplotlib-pie-chart/
Data Cleaning
Tweets contain many slang words and punctuation marks. The tweets will have to be cleaned before they can be used for training the machine learning model.
Before cleaning the tweets, the dataset should be split into into feature and label sets.
The feature set will consist of tweets only. The label set will consist of the sentiment of the tweet that we have to predict. The tweet test is in the 10th column. The sentiment of the tweet is in the second column (index 1). To create a feature and a label set, we can use the iloc method off the pandas data frame.
features = airline_tweets.iloc[:, 10].values
labels = airline_tweets.iloc[:, 1].values
print(features[1:5])
## ["@VirginAmerica plus you've added commercials to the experience... tacky."
## "@VirginAmerica I didn't today... Must mean I need to take another trip!"
## '@VirginAmerica it\'s really aggressive to blast obnoxious "entertainment" in your guests\' faces & they have little recourse'
## "@VirginAmerica and it's a really big bad thing about it"]
print(labels[1:5])
## ['positive' 'neutral' 'negative' 'negative']
Now, the tweets should be cleaned. Functions developed in the introduction to NLP could be used, howevever knowing how to use regular expression is a key skill in NLP and will be used here to clean the text. Further information on how to use regular expressions in Python can be found here: https://stackabuse.com/using-regex-for-text-manipulation-in-python/
processed_features = []
for sentence in range(0, len(features)):
# Remove all the special characters
processed_feature = re.sub(r'\W', ' ', str(features[sentence]))
# remove all single characters
processed_feature= re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_feature)
# Remove single characters from the start
processed_feature = re.sub(r'\^[a-zA-Z]\s+', ' ', processed_feature)
# Substituting multiple spaces with single space
processed_feature = re.sub(r'\s+', ' ', processed_feature, flags=re.I)
# Removing prefixed 'b'
processed_feature = re.sub(r'^b\s+', '', processed_feature)
# Converting to Lowercase
processed_feature = processed_feature.lower()
processed_features.append(processed_feature)
print (processed_feature[12])
Representing Text in Numeric Form
To make statistical algorithms work with text, we first have to convert text to numbers. See Intro to NLP, section on vectorisation.
The tweets will be scored using TF-IDF mechanism.
Documents are not written in a jumbled way. Are they? The sequence of words in a document is critical. But in this context of sentiment classification, this sequence is not treated as very important. What is more important or the most important part here is the presence of these words.
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer (max_features=2500, min_df=7, max_df=0.8, stop_words=stopwords.words('english'))
processed_features = vectorizer.fit_transform(processed_features).toarray()
In the code above, we define that the max_features should be 2500, which means that it only uses the 2500 most frequently occurring words to create a bag of words feature vector. Words that occur less frequently are not very useful for classification.
Similarly, max_df specifies that only use those words that occur in a maximum of 80% of the documents. Words that occur in all documents are too common and are not very useful for classification. Similarly, min-df is set to 7 which shows that include words that occur in at least 7 documents.
Dividing Data into Training and Test Sets Before we train our algorithms, we need to divide our data into training and testing sets. The training set will be used to train the algorithm while the test set will be used to evaluate the performance of the machine learning model.
X_train, X_test, y_train, y_test = train_test_split(processed_features, labels, test_size=0.2, random_state=0)
The train_test_split class from the sklearn.model_selection module to divide our data into training and testing set. The method takes the feature set as the first parameter, the label set as the second parameter, and a value for the test_size parameter. A value of 0.2 for test_size is specified which means that our data set will be split into two sets of 80% and 20% data. The 80% dataset will be used for training and 20% dataset for testing.
Training the Model
The Random Forest algorithm will be used due to its ability to act upon non-normalized data.
The sklearn.ensemble module contains the RandomForestClassifier class that can be used to train the machine learning model using the random forest algorithm.
First a call to the fit method on the RandomForestClassifier class is performed. This is then passed to the training features and labels, as parameters.
 
from sklearn.ensemble import RandomForestClassifier
text_classifier = RandomForestClassifier(n_estimators=200, random_state=0)
text_classifier.fit(X_train, y_train)
## RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
## max_depth=None, max_features='auto', max_leaf_nodes=None,
## min_impurity_decrease=0.0, min_impurity_split=None,
## min_samples_leaf=1, min_samples_split=2,
## min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=1,
## oob_score=False, random_state=0, verbose=0, warm_start=False)
Making Predictions and Evaluating the Model Once the model has been trained, the last step is to make predictions on the model.
To do so, a call is made to the predict method on the object of the RandomForestClassifier class that we used for training.
Look at the following script:
predictions = text_classifier.predict(X_test)
Finally, to evaluate the performance of the machine learning models, classification metrics such as a confusion metrix, F1 measure, accuracy are employed as shown below.
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
print(confusion_matrix(y_test,predictions))
## [[1723 108 39]
## [ 326 248 40]
## [ 132 58 254]]
print(classification_report(y_test,predictions))
## precision recall f1-score support
##
## negative 0.79 0.92 0.85 1870
## neutral 0.60 0.40 0.48 614
## positive 0.76 0.57 0.65 444
##
## avg / total 0.75 0.76 0.74 2928
print(accuracy_score(y_test, predictions))
## 0.759904371585
Different Python libraries were used to contribute to performing sentiment analysis. An analysis was done on public tweets regarding six US airlines and achieved an accuracy of around 75%.
Using the IMDB movie reviews dataset for sentiment analysis available here https://www.kaggle.com/oumaimahourrane/imdb-reviews:-
References:
https://medium.com/seek-blog/your-guide-to-sentiment-analysis-344d43d225a7 https://stackabuse.com/python-for-nlp-sentiment-analysis-with-scikit-learn/ https://monkeylearn.com/sentiment-analysis/ Maas, A., Daly, R., Pham, P., Huang, D., Ng, A. and Potts, C. (2011). Learning Word Vectors for Sentiment Analysis: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. [online] Portland, Oregon, USA: Association for Computational Linguistics, pp.142–150. Available at: http://www.aclweb.org/anthology/P11-1015.